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Ph ■ Abstract 

In this paper we tackle the problem of fast rates in time series forecasting from a statistical 
learning perspective. In a serie of papers (e.g. Mcir (2000); Modha and Masry (1998); 
Alquier and Wintenberger (2012)) it is shown that the main tools used in learning theory 
f-H . with iid observations can be extended to the prediction of time series. The main message of 

these papers is that, given a family of predictors, we are able to build a new predictor that 
pH predicts the series as well as the best predictor in the family, up to a remainder of order 

\j \fn. It is known that this rate cannot be improved in general. In this paper, we show 
that in the particular case of the least square loss, and under a strong assumption on the 
time series (0-mixing) the remainder is actually of order 1/n. Thus, the optimal rate for iid 
variables, see e.g. Tsybakov (2003), and individual sequences, see Ccsa-Bianchi and Lugosi 
(2006) is, for the first time, achieved for uniformly mixing processes. We also show that 
our method is optimal for aggregating sparse linear combinations of predictors. 
Keywords: Statistical learning theory, time series prediction, PAC-Bayesian bounds, or- 
■ acle inequalities, fast rates, sparsity, mixing. 
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C^: 1. Intro 

o 

The problem of time series forecasting is a standard problem in statistics. The parametric 
approach contains a wide range of models associated with efficient estimation and prediction 
methods, see e.g. Hamilton (1994); Brockwell and Davis (2009). 
^ ' In the last few years, several universal approaches emerged from various fields such 

that non-parametric statistics, machine learning, computer science and game theory. These 
approaches share some common features: the aim is to to build a prediction procedure that 
is able to predict the series as well as the best predictor in a given set of initial predictors, 
say 0. The set of predictors are usually inspired by different parametric or non-parametric 
statistical models. The true distribution of the data is not assumed to belong to one of 
these models. However, we can distinguish two classes in these approaches, with different 
quantification of the objective, and different terminologies: 

• in the "prediction of individual sequences" approach, predictors are usually called 
"experts" . The objective is online prediction: at each date t, a prediction of the future 
realization xt+\ is based on the previous observations x%, xt, the objective being 
to minimize the cumulative prevision loss. See for example Cesa-Bianchi and Lugosi 
(2006); Stoltz (2010) for an introduction. 
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• in the statistical learning approach, the given predictors are sometimes referred as 
"models" or "concepts" . The batch setting is more classical in statistics. A prediction 
procedure is build on a complete sample X\, X n . The performance of the procedure 
is compared on average with the best predictor, called the "oracle" . The environment 
is not deterministic and some hypotheses like mixing or weak dependence is required: 
see Meir (2000); Modha and Masry (1998); Alquier and Wintenberger (2012). 

In both settings, we are able to predict a bounded time series as well as the best expert, 
up to a small remainder. This type of results is referred in statistical theory as an oracle 
inequality. In general, neglecting the size of the set of predictors 0, the remainder is 
of the order l/y/n in both approaches: see, e.g., Cesa-Bianchi and Lugosi (2006) for the 
"individual sequences" approach; for the "statistical learning approach" the rate l/y/n is 
reached in Alquier and Wintenberger (2012). This paper is based on the following remark: 
in the case of prediction of individual sequences, under stronger assumption on the loss 
function (satisfied e.g. by the quadratic loss), a fast rate 1/n can be reached. Note that 
Meir (2000); Modha and Masry (1998) deal with the quadratic loss, their rate can be better 
than l/y/n but cannot reach 1/n. Here, we prove that the same result is true in the 
statistical learning setting. Namely, under a ^-mixing assumption introduced in Ibragimov 
(1962), we are able to reach the fast rate in the batch setting for the quadratic loss. 

Following Alquier and Wintenberger (2012), we will use tools from the PAC-Bayesian 
theory to build our prediction procedure. Historically, the PAC-Bayesian point of view 
emerged in statistical learning to deal with supervised classification (using the 0/1-loss), 
see the seminal papers Shawe- Taylor and Williamson (1997); McAllester (1999). These re- 
sults were extended to general loss functions and more accurate bounds were then given, see 
for example Catoni (2004, 2007); Alquier (2008); Dalalyan and Tsybakov (2008); Audibert 
(2010); Alquier and Lounici (2011); Seldin et al. (2011); Gerchinovitz (2011). Interestingly 
enough, PAC-Bayesian methods often lead to a prediction procedure that is an aggre- 
gation of the various predictors in with exponential weights, a standard procedure 
in individual sequences prediction (introduced by Vovk (1990); Littlestone and Warmuth 
(1994)). It is striking to note that this procedures receives theoretical justification from ap- 
proaches that have so different philosophies and objectives. This procedures received various 
names: EWA, for Exponentially Weighted Aggregate, in Dalalyan and Tsybakov (2008); 
Gerchinovitz (2011), Gibbs estimator in Catoni (2004, 2007); Alquier (2008); Audibert 
(2010), weighted majority algorithm in Littlestone and Warmuth (1994)... In Audibert 
(2004), it is also proved that this estimator is simply the Bayesian estimator under suitable 
model and prior. 

In Section 2 we introduce the notations used in the whole paper, in particular the time 
series [Xt)t& and the set of predictors 0. Section 3 is devoted to the description of the 
Gibbs estimator. Our main result is Theorem 1, it is stated in Section 4. In Section 5 we 
provide examples of time series satisfying the main assumption of Theorem 1 (^-mixing). In 
Section 6 we discuss the implementation of our procedure using MCMC methods and show 
the results of some simulations. Finally, proofs are given Section 8, with some technical 
results postponed to the appendix. As we will see, the main tool needed to apply PAC- 
Bayesian techniques is a control of the Laplace transform of the prevision risk. In the iid 
setting, this might be done using classical Hoeffding's or Bernstein's Inequalities. In the 
context of ^-mixing, such a result is provided by a powerful result in Samson (2000). 
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Note that in this paper, we focus on the case where the set of predictors is the linear 
span of a finite family of basic predictors. Theorem 1 will be of particular interest in the 
case where a sparse combination of those basic predictors provide a good prediction. But 
the results in these paper can be extended in other contexts (e.g. if we only want to predict 
as well as the best basic predictor). The proof of Theorem 1 involves a general result, 
Lemma 2, that can be adapted to these various context. 

2. The context 

2.1. The observation 

We assume that we observe (Xi, . . . ,X n ) where (X t )t& is a real, stationary process, 
bounded by a constant B. We remind the ^-mixing coefficients of the process (Xt) as 
introduced by Ibragimov (1962): 

Definition 1 (^-mixing coefficients) We define the cp-mixing coefficients of the process 
(X t ) te z by 

4> r = sup \ir(B/A) — n(B)\ 
(A,B)ee x3r 

where &o = o~(Xt,t < 0) and 3y = o~(Xt,t > r). We also define: 

n~q 

*J° (9) :=1 + E\fc' 

r=l 

2.2. Set of predictors 

We set a value q and a family of functions: g\, g p : [—B,B] q — > [—B,B]. The set of 
predictors, for a given b > 0, is defined by: 

{f e ,6eG(b)} 

where 9(6) = {fleP: ||0||i < 6}, and 

p 

fe = J2^j9j- 

3=1 

We also put = W and our objective is to find a 6 such that X q+ \ is well predicted by 
fe(X q , ...,-Xi) on average under the stationary distribution. 

Note that we will allow very large set of predictors (experts, ...). Actually, we will 
allow n <C P- In this case, a sparsity assumption will be necessary: namely, it is pos- 
sible to build a good predictor 9 such that most of its coordinates are close to 0. This 
is now a classical assumption in statistical learning theory, see e.g. Tibshirani (1996); 
Biihlmann and van de Geer (2011). 
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Example 1 (Auto-regressive predictors) A very classical example is to design predic- 
tors based on auto-regressive models (AR). We put p = q and g%(x q , ...,x%) = x q , 
g q (x q , xi) = x\ so we obtain AR predictors 



fe(X q ,...,Xi) — y~] OjX, 



q 

ft -Y 

P-J- 



Note that in this case, p < n. 



Example 2 We can extend the previous setting to non-linear AR predictors. For example, 
We take p = 2 q and gi(x q , X\) = l(x g > 0, ...,x\ > 0), then g2(x q , x±) = l{x q > 

0, ...,X2 > 0, X\ < 0), Up to g2l(xq, ■■■■,X\) = l(x q < 0, X\ < 0). 

Definition 2 (Prevision and empirical risks) We define the prevision risk 

R(6) = E P [[X q+l - f e (X q , Xx)] 2 } 

and the empirical risk 

1 " 

r(9) = V [Xi- f e {X^ x ,...,Xi 

n — a ^— ' 



n — q . 

H i=q+l 



1 >-qj] 



and 



9 G are; min R. 
e 



The objective is to build an estimator 9 based on the observations (X%, . . . ,X n ) such 
that R{9) is as small as possible. We see in the next sections that the Gibbs estimator 
reaches this objective. 

3. Description of the method 

Ths Gibbs estimator as defined in Catoni (2007) requires a prior distribution on the pa- 
rameter space. 

Definition 3 (The prior) For I C {1, ...,p}, b > 0, 

9/(6) = jfl € 0(6) : Vi£ I,9i = 

and 



G 7 = j0ee: Vi(£l,9i = 
Finally, let us put 7r( the uniform probability measure on 0/(6+1). We put, for some b > 0, 



k=0 I c {i, ...,p} 

III = k 
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Remark that in order to predict as well as the best predictor in 0(6), the prior distri- 
bution has to be defined on 0(6 + 1), for technical reasons that will become clear in the 
proofs (see the appendix). We are now ready to give the definition of the Gibbs estimator. 

Definition 4 (Gibbs estimator) We define, for any 6 > and A > 0, such that 

d P\h / m = exp [-Ar(fl)] 
diTb j e ^exp[-Xr}d^T b , 

and we put 

x ,b= [ 0p x {dB). (1) 
J 0(b) 

The parameter A is called the inverse temperature parameter. Its choice is a problem in 
practice, see the discussions in Catoni (2003, 2004, 2007); Alquier (2008). In theory, we will 
see that A of the order n will lead to fast rates for prediction. In practice, A = n/var(X) 
leads to satisfying results in our simulations, where var(X) is the empirical variance of 
the observed time series. The practical computation of 9\^ can also be a problem. In 
Dalalyan and Tsybakov (2008) a Langevin Monte-Carlo algorithm is used. Here, as in 
Alquier and Lounici (2011), the Reversible Jump MCMC of Green (1995) is used, see Sec- 
tion 6. 



4. Theoretical results 

Theorem 1 (Oracle inequality for the Gibbs estimator) Assume that \\9\\\ < 6 and 
that there exists a constant $(<?) such that for any n G N, $((/) > K^ n \q). Choose 

rj(n - q) 



r] G 



16 
W) 



and A 



64$(g)(2 + 6) 2 £ 2 ' 



We have, with probability at least 1 — e on the drawing of the sample {X\, • • • , X n ), 

+ q 



R(6) < 



inf 

I C {1, ...,p} 



m < 



32*(g)(2 + b) a 

e e e,(b) 



R(6) - R(9)^j 



64$(g)(2 + 6) 2 5 2 
(n - q)r) 



i\ 




+ 21og( - 



The full proof is given in the appendix. In order to understand this result, it is par- 
ticularly useful to think of a particular case where there is a sparse optimal predictor: we 
assume that there is a 9 G argmine(&) R that has only a few number po of non-zero coor- 
dinates. This is the classical "sparsity" assumption. Then in this case, taking 9 = 9 in the 
previous result leads to 



R(9 x>b ) - R(9) 

64$(g)(2 + 6) 2 B 2 



< 



(n — q)rj 



Po 



B + 2 log 



Bbpe / 2r)(n — q) 



Po 



Po 



+ 2 log 



(2) 
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for n large enough - actually, n > q+po[323>(q)(2 + b) 2 ]/r). We obtain that this is not the true 
dimension p of that determines a rate p/n, but the intrinsic dimension po of 9 as the rate 
is po log(pn)/n. With iid observations, Dalalyan and Tsybakov (2008); Alquier and Lounici 
(2011) obtained the same result, with rate polog(p)/n. In Gerchinovitz (2011), the same 
rate is reached in the context of prediction of individual sequences. 

Note that of course the strength of Theorem 1 when compared to Inequality 2 is that it 
ensures that will give good prediction not only when 6 is sparse, but also when it can 
only be approximated by a sparse parameter 8. 

Remark 1 The value of A proposed in the Theorem depends on the <f)-mixing coefficients 
of the time series. Of course, these coefficients are unknown. One can check in the proof 
of Theorem 1 that any A of the order of n would lead to the same rate of convergence, 
but with less precise constants. However, in practice, this does not tell us how to cali- 
brate A. It is of course possible to use a procedure such as cross-validation. However, in 
Dalalyan and Tsybakov (2008) or Alquier and Lounici (2011), it is observed that the value 
A = n/(4cr 2 ) or A = n/(2a 2 ), where a 2 is the variance of then noise, performs well in 
practice, and receives a theoretical justification in the iid setting. So we propose here the 
heuristic value A = n/v&i(X) leads to satisfying results in our simulations, where var(X) is 
the empirical variance of the observed time series. We will see in Section 6 that it performs 
well on a set of simulations. 

5. Some examples of ^-mixing processes 

In this section we study the behavior of the prediction procedure on some classical ^-mixing 
processes. In all the section (et) denotes an iid sequence called the innovations. 

5.1. The AR(p) model 

We consider the case where the observations (Xf) satisfy an AR(p) model: 

v 

X t = Y,ajX t -j + e t , VteZ. (3) 
j'=i 

Here both p G {1,2, . . .} and (a,) are unknown, (ej) is bounded with a distribution pos- 
sessing an absolutely continuous component. Assume that A(z) = Y^j=i a j z ^ nas no ro °t 
inside the unit disk in C. Then it exists a stationary solution (Xt) that is an exponen- 
tially (^-mixing processes, in the sense that the coefficients <f> r decay exponentially fast, see 
Athreya and Pantula (1986). 

5.2. The MA(<j) model 

We consider now observations (Xf) such that Xt = Ylj=i bj € t-j f° r an t € Z. Assume that 
B( z ) = Ylj=i bj z ^ has no root inside the unit disk in C so that (Xt) is invertible (admits 
an AR(oo) representation). By definition the process (Xt) is stationary and (^-dependent - 
it is even ^-dependent, in the sense that <p r = for r > q. Moreover it is bounded iff the 
innovations are bounded. So this process satisfies the assumptions of Theorem 1. 
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5.3. Non linear models 



Consider an extension of the AR(p) model of the form 



Xt — F(X t -i, . . . , Xt-p) et) 



Vt E Z. 



(4) 



To prepare the general case we recall some material from Meyn and Tweedie (1993). Re- 
member that the observations are assumed to belong to the compact set [— B,B]. The 
Lagrange stability, irreducibility and aperiodicity conditions hold when the innovations ad- 
mits a lower semi-continuous density on [-B, B] and for any \x\ < B we have 

[-B, B] = A + {x) := {F k (x,wi,. . .,w k ); k>l, (u>i, . ..,w k ) e Support fc (e)} 

with F k : M. k+l i->- E defined recursively by the relation F k+ i(-, w) = F(F k (-),w), F\ = F. 
A direct application of Proposition 7.5 of Meyn and Tweedie (1993) yields that (X t ) is 
a T-chain (we refer to Meyn and Tweedie (1993) for the definition) if Fi(x,w) is con- 
tinuously differentiate on w and for each xo £ W there exists {w k )i<k<p such that 
dF k /dw k (xo,wi, . . . ,w k ) 7^ for all 1 < k < p. For example the generalized AR-GARCH 
models of the form F(x, w) = R(x) + a(x)w with R and a > continuously differentiable 
is a T-chain. 

Assume that (Xt) is an irreducible, aperiodic, Lagrange stable T-chain. Then it sat- 
isfies the Doeblin condition and is thus exponentially ^-mixing, see Theorem 16.2.7 of 
Meyn and Tweedie (1993). 

6. Implementation and simulations 

6.1. RJMCMC method 

The Gibbs estimator, given by (1), takes the form of an integral over a large dimensional 
space. It can thus be computed by Monte Carlo methods. This is actually a classical 
approach for Bayesian estimators, see e.g. Marin and Robert (2007); Robert (1996). Here, 
we use the RJMCMC algorithm - Reversible Jumb Markov Chain Monte Carlo, Green 
(1995). This method is implemented for example in Alquier and Lounici (2011) to compute 
a Gibbs estimator that takes exactly the same form than ours. 

6.2. Simulations study in the AR case 

We compare here the Gibbs estimator given by (1) to the "classical approach" in the AR 
case. This approach, for example as implemented in the R software (R Development Core Team 
(2008)), computes the least square estimator in each submodel AR(p) and then selects the 
order p by Akaike's AIC criterion Akaike (1973). 

We generate the data according to the following models: 



X t = 0.5AV! +0.1AV 2 + ei 
X t = 0.6Xt-4 + 0.1Xt-s + e t 
X t = cos(A t _i) s'm(X t -2) + s t 



(5) 
(6) 
(7) 
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Table 1: Performances of the Gibbs estimator, AIC and least square estimator in the full 
model, on the simulations. Each simulation is repeated 20 times, we report on Ta- 
ble 1 the mean performance and standard deviation of each method. We highlight 
the best result for each line. 



71 


Model 


Innovations 


Gibbs 


AIC 


Full Model 


100 


(5) 


unif . 

Gaussian 


0.165 (0.022) 
0.167 (0.023) 


0.165 (0.023) 
0.161 (0.023) 


0.182 (0.029) 
0.173 (0.027) 




(6) 


unif. 

Gaussian 


0.163 (0.020) 
0.172 (0.033) 


0.169 (0.022) 
0.179 (0.040) 


0.178 (0.022) 
0.201 (0.049) 




(7) 


unif. 

Gaussian 


0.174 (0.022) 
0.179 (0.025) 


0.179 (0.028) 
0.182 (0.025) 


0.201 (0.040) 
0.202 (0.031) 


1000 


(5) 


unif. 

Gaussian 


0.163 (0.005) 
0.160 (0.005) 


0.163 (0.005) 
0.160 (0.005) 


0.166 (0.005) 
0.162 (0.005) 




(6) 


unif. 

Gaussian 


0.164 (0.004) 
0.160 (0.008) 


0.166 (0.004) 
0.161 (0.008) 


0.167 (0.004) 
0.163 (0.008) 




(7) 


unif. 

Gaussian 


0.171 (0.005) 
0.173 (0.009) 


0.172 (0.006) 
0.173 (0.009) 


0.175 (0.006) 
0.176 (0.010) 



where s% is the innovation. We will use two models for the innovation: the uniform case, 
St ~ U[—a,a], and the Gaussian case, St ~ A/"(0, a 2 ). In the first case, the processes defined 
in (5), (6) and (7) satisfy the assumptions of Theorem 1 (see Section 5) while the Gaussian 
case is more classical in statistics, so it is worth testing if our method performs well in this 
context too - even if our method does not receive any theoretical justification in this case, 
as it is show in Doukhan (1994) that autoregressive processes with gaussian noise are not 
(^-mixing. We take a = 0.4 and a = 0.70 (In both cases this leads to Var(e t ) ~ 0.16). 
The Gibbs estimator is used on all the possible AR models as in Example 1; we fix q = 20 
and A = n/var(X), where var(X) is the empirical variance of the observed time series. We 
compare its performances to the ones of AIC criterion as implemented in the R software 
and to the basic least square estimator in the model AR(q) - that we will call "full model" . 
The experimental design is the following: for each model, we simulate a time series of length 
2n, use the observations 1 to n as a learning set and n + 1 to 2n as a test set. We report 
the performances on the test set. We take n = 100 and n = 1000 in the simulations. Each 
simulation is repeated 20 times, we report on Table 1 the mean performance and standard 
deviation of each method. 

It is interesting to note that our estimator performs better on Model (6) and Model (7) 
while AIC performs slightly better on Model (5). The differences tends to be less perceptible 
when n grows - this is coherent with the fact that we develop here a non-asymptotic theory. 
It is also interesting to note that our estimator seems to work well even in the case of a 
Gaussian noise. 

7. Conclusion 

We proved that the Gibbs estimator can reach fast rates in the case of 0-mixing time series. 
It would now be interesting to extend this result to a more general class of processes, e.g. 
weakly dependent ones. Note however the versions of Bernstein's inequality known in the 
context of weak dependence (see e.g. Dedecker et al. (2007); Wintenberger (2010)) do not 
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allow to reach this rate up to our knowledge. More generally, the question of concentration 
of measure for time series is on a large part still open. 

Another question is to provide a theoretical justification to our heuristic for the turning 
of A in practice. 



8. Proof of Theorem 1 



We start by a short overview of the proof. First, we state a result, Lemma 1, that provides 
a control of the difference between the risk and the empirical risk of a predictor. The main 
tool for the proof of this result is Samson's version of Bernstein's inequality in Lemma 3, 
that we remind in the appendix. Lemma 1 is then used together with Donsker-Varadhan 
variational formula (also reminded in the appendix, Lemma 4) to prove a PAC-Bayesian 
type oracle inequality similar to the ones in Catoni (2004), Lemma 2, that is the main tool 
used to prove Theorem 1 

Lemma 1 Under the hypothesis of Theorem 1, we have, for any 9 6 0(6 + 1), for any 
< A < (n - g)/[4(2 + b) 2 B 2 § 2 (q)], 



E exp < A 



n 



and 



E exp < A 



i + **h)M2 + WB> um _ m) _ r( f i + T(e) 



n 



< 1, 



< 1. 



Proof [Proof of Lemma 1] We apply Samson's version of Bernstein's inequality (see Lemma 
(3) in the Appendix) to N = n — q, Z{ = (Aj + i, . . . , Xi+ q ), 



1 



n 



R{9) - R(9) 

— {Xi+q - fe(X i+q -i, . . . , X i+ i)) 2 + (X i+q — /g(Xj +9 _l, . . . , X i+ i)) 



Note that we have: 

S(f) = [R(9)-R(0)-r(9) + r(9)], 
and the Z% are uniformly mixing with coefficients = 4>\_ r / q \. Note that K^z = 1 + 
Er=i \f^/q~\ = K^il) <®(<l)- For an y 6 and e ' in let us P ut 



V(9,9')=E S 



X q+ l - fe(X q , —,Xl) ) - [X q+ l - fo>(X q , X 



1 2 



Noticing that a 2 (f) < V(9,9)/(n - q) 2 and that 
< A < (n - g)/[4(2 + b) 2 B 2 <5> 2 {q)}, we have 



In Ep exp 



X[R{9) - R(9) - r(9) + r(9) 



< 



< 4(2 + b) 2 B 2 /{n - q), for any 
8<£(q)\ 2 V(9,9) 



n 
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Notice also that 

V(9,9) = E P { [2X q+1 - (f + fg)(X q , ...,Xx)] 2 [(f e -f^)(X g ,...,X 1 )] 2 ] 

< (2 + ll^li + p||i) 2 £ 2 E P { [Ue ~ fe)(X q , ...,*i)] 2 } 
= (2 + ||0||i + p||i) 2 B 2 [R(B) - R(5)] < 4(2 + b) 2 B 2 [R{6) - R(9)] 

as 9 € 0(6 + 1) and 9 E 0(6) C 0(6 + 1). This proves the first inequality of Lemma 1. The 
second inequality is proved exacly in the same way, but replacing / by — /. ■ 



We are now ready to state the following key result. Note that the very classical definition 
of the Kullback divergence K{p,ir) is reminded in the appendix. 

Lemma 2 (PAC-Bayesian oracle inequality for a ^-mixing process) Under the hy- 
pothesis of Theorem 1, we have, for any < A < (n — g)/[4(2 + b) 2 B 2 Q 2 (q)], for any 
< e < I, 

{ VpE A<(0(6 + 1)), 

(l ~ (J Rd P - R(9)) <Jrdp- r(g) + 



and 



>>!-£. 



frdp-r(9) < {fRdp-R(9)) 1 + 



32$(q)\(2+b) 2 B 2 \ /C(p,7r)+log(f) 



n—q 



+ 



Proof [Proof of Lemma 2] Let us fix e, A and 9 E 0(6 + 1), and apply the first inequality 
of Lemma 1. We have: 



E exp <^ A 



n 



< h 



and we multiply this result by e/2 and integrate it with respect to 7T(,(d0). Fubini's Theorem 
gives: 



E / exp ^ A 



32$(g)A(2 + 6) 2 ff 2 
n — q 



(R(0) — R(9)) — r{9) + r(9) + log 



Md9) 

£ 

< -. 



We apply Donsker-Varadhan variational formula (see Lemma 4 in the appendix) and we 
get: 



E exp < sup A 

p 



32$(g)A(2 + b) 2 B 2 



n 



Rdp - R(9) J - rdp + r{9) + log 



fC(p,n) 



e 

< -. 

~ 2 
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As e x > 1r + (x), we have: 

32$(g)A(2 + bfB 2 



Ft sup A 



1 



n — q 



Rdp-R(9) - / rdp + r(9) 



+ log ( - 



/C(p,tt) >0f < -. 



Now, we follow the same proof again but starting with the second inequality of Lemma 1. 
We obtain: 



P< sup A 



1 + (R(S) - J M P ) - HI) + / rdp 



20 % 



A union bound ends the proof. 



We are now ready to give the proof of Theorem 1. 
Proof First, we apply Lemma 2. From now, a work on the event of probability at least 
1 — e given by this lemma. In particular we have Vp G A4i_(0), 



I 



Rdp - R(6) < 



1 



32$(g)A(2+fc) 2 B 2 



n—q 



For the sake of simplicity, during this proof, we will use the following notation: 

C = 32$ (q) (2 + b) 2 B 2 . 



I 



Taking p = p\ t b leads to: 

Rdp x , b - R(8) < 
We apply Lemma 4 to see that: 



Rdp x , b - R(6) < inf 
p 



f rdp x ,b - r(0) + 



m ^ X:dS A , t ,7r)+log(f) 



1 



AC 
n—q 



j rdp _ r( g ) + jE^W|) 



AC 
n—q 



Now, we use the second inequality of Lemma 2 to see that 



/ 



iMp A , 6 - fl(0) < inf 



1 



AC 

n—q 



< inf inf 

/C{l,...,g}p<7r^ 



1 + ^) (Jfidp-^)) + 2^±^ 
1 _ ML 

n—q 



(8) 
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By Jensen's inequality, 

J Rdp Xib > R ( S j, 
Also remark that, as soon as p <C 7r^, 

JC(p, tt) = (|/| + 1) log(2) + log + JC{p, 4 



<(|/| + l)log(2) + |/|log (*l\+K(py b ) 



(see, e.g., Catoni (2003) page 190). Now, for any < S < 1, for any / C {l,...,p}, and 
9 € &i(B), we take /^/^ as the uniform measure on {t E 0/(6) : \\t — 8\\i < 5}. Note that 
as 9 £ @i(B) and 5 < 1, the support of is included in 0(6 + 1) the support of 7Tb. This 
is the reason why nb is defined in this way. Inequality (8) leads to 



R[9x 



R(9) 



< 



1 - «>o/c{i,...,«}fleej(6) 



inf inf 



inf 



1 + 



AC 



n 



(i2(0) + £ 2 <5 2 - R(9)) 



+ 2 



(|/| + 1) log(2) + |/| log (g) + |/| log (|) + log (§) 



A 



and so, by choosing 5 = y / \I\/(2B 2 X), we get: 



RIB 



R(9) 



< 



inf inf 



inf 



1 - 5>b/c{i7».?}^ee7(6) 

n— g 



1 + 



AC 
n — q 



{R(9) - R(0)) 



+ 



B + 21og(2«fS v ^))+21og(f) 



Remember that A < [n — q)c where we put for short c = l/[4(2 + b) 2 B 2 <& 2 (q)]. Let us take 
A = rj(n — q)/(2C) for some constant rj. Remark that rj < 2cC ensures that A < (n — q)c 
while we need to impose |/| < r\B 2 {n — q)/C in order to ensure that 5 < 1. We obtain: 



P<^ R0 x ,b) - R(9) < 



+ 



2C 



(n — q)rj 



inf 

I C {1, ...,p] 
\I\ < TiB 2 (n - q)/C 

e &i(b) 



III LB + 2 log 



2 + rj 



R{9) - R{8) 



Bbpe / 2r\{n 



l/l 



l/l 



+ 21og ( - 



> 1-e. 



We end the computation by the remark that A = r/(n—q)/(2C) = r/(n— q)/ [64$ (q)(2+b) 2 B 2 ] 
and that r? < 2cC = 16/$(g). ■ 
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Appendix A. Samson's version of Bernstein's inequality and 
Donsker-Varadhan variational formula 

Lemma 3 (Samson (2000) (page 460, line7)) Let N G N. Let (Zi) ie z be a station- 
ary process, let (4>r) denote its 4>-mixing coefficients, let f be a measurable function M — > 
[-M, M] and let 

N 

s N (f) -.= ^m). 

1=1 

Then: 

lnE(exp(A(5(/) -E5(/)))) < 8K^Na 2 (f)X 2 , for all < A < \/{MK^), 
where K^z = 1 + J2r=i V^r and ° 2 (f) = Var [f( z i)\- 

Definition 5 Given a measurable space (E,£) we let M\{E) denote the set of all proba- 
bility measures on (E,£). The Kullback divergence is a pseudo-distance on M\(E) defined, 
for any (tt,tt') G [.A/f 2 by the equation 



/C(^,vr') 



7r[log(d7r/d7r')] if 7r <C it' , 



+oo otherwise. 

with the convention that ir[h] = f h(x)ir(dx) for any measurable function h. 

Lemma 4 (Donsker and Varadhan (1976) variational formula) For any n in the set 

Ai\{E), for any measurable function h : E — ?■ R such that 7r[exp(/i)] < +oo we have: 

?r[exp(/i)] = exp sup I p[h] - K,(p,n) ) , (9) 
\ P eM\{E) \ J ) 

with convention oo — oo = — oo. Moreover, as soon as h is upper-bounded on the support of 
it, the supremum with respect to p in the right-hand side is reached for the Gibbs measure 
ir{h} defined by 

e h ^Tr(dx) 



n{h}(dx) 



7r[exp(/t)] 
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